Global Open Access COVID-19 Epidemiological Data: What is it? How good the data quality is? and How can we improve it?

1. What is Open Access COVID-19 Epidemiological Data?

Coronavirus disease 2019 (COVID-19) has been spreading rapidly across the globe. Basic aggregate data such as total number of tests, new cases, and deaths have been published regularly and frequently, providing critical information to understand the magnitude, pace, and location of the epidemic. Compared to previous disease outbreaks, we see substantially more publicly accessible data (Dong 2020, WHO 2020) and use of such data by media and researchers. This rapid and transparent information has been extremely valuable for not only the public health authorities but also the general public.

In addition to the aggregate data, a research group published Open Access Epidemiological Data, a centralized global database with information for each individual case based on reports from WHO, Ministries of Health, and Chinese health authorities. This rapid action to establish and share this open access, machine-readable, regularly-updated (about twice a day), de-identified case-level data is beyond commendable. It can provide data to answer questions that require disaggregated information by specific background characteristics. There are 34 variables in the database to cover age, sex, sub-national location, presence of underlying chronic illness, travel history, and dates of onset of symptoms, hospitalization, and laboratory confirmation. Further methodologic details about the database) were recently published. The research team collaborated with data curators who thoroughly assessed source data from individual countries as well as WHO, and applied advanced data management techniques to create and update the large database. Anybody can view/access it at their GitHub. So, as a data enthusiast, I was thrilled to see it and do admire efforts by the research team.

2. What does the data quality look like?

As a good public health data scientist, meanwhile, I started assessing the data quality. However, the quality of the case-level data is suboptimal.

First, timeliness, one of the most important requirements for outbreak data, is poor. As of March 31, 6:41PM EST, a total of 131260 cases from 107 countries have been included in the database. This is less than 20% of the total confirmed cases available on the same date from currently available sources - like this or this. Since there is no easily accessible information on the date of the source data, we cannot assess the actual timeline between laboratory confirmation, reporting of the result in the source data, and inclusion of the case in the database.

2.1 Overall completeness of reported age and sex

Second, completeness (i.e., the extent to which an actual and reasonable value is recorded, for example age 57, not 575 or missing) is one of the most important attributes of high quality data, as analyses based on less complete data can produce biased results. When completeness of age and sex information - the two most basic demographic characteristics in epidemiologic studies - is assessed among the currently available 131260 cases, only 3.9% and 4.1% have complete information for age and sex, respectively

Even among cases with complete age, there is substantial variation in units used by different countries, challenging pooled data analysis. Further, sometimes age reporting is in units that are too wide for epidemiologic studies. Among cases with age, 66% have age reported in a single year, 0% in 5-year age groups, 25% in 10-year age groups, and 8% in age groups exceeding 10 years. Open-age groups - which are typically used for older populations - are used in 0% of cases with age in the database, but often the groups reference the general adult population rather than older populations; (e.g., ‘20 and above’, ‘18 and above’) or the groups start at a relatively low, old age (e.g., ‘60 and above’). This is problematic, since mortality varies greatly by age even in that open range of 60 and above. See this example of age-specific mortality rates in South Korea.

2.2 Completeness of reported age and sex by country

Across countries, data completeness varies substantially. Among 36 countries that have 30 or more confirmed cases included in this database, the below figure shows completeness in age and sex reporting by country. It ranges from 0% to 100% in Singapore for both reported age and sex of patients (Figure 1). There is no clear relationship between the total number of cases and completeness (e.g., the less cases, the higher completeness) (Figure 2).

Hover over the figures to see values and more options.

Figure 1. Completeness of age and sex reporting by country (among 36 countries with 30 or more confirmed cases in the database)

Figure 2. Relationship between completeness of reporting and the number of cases (among 36 countries with 30 or more confirmed cases in the database)

2.3 Completeness of reported age and sex by country: How has it changed over time?

More interestingly, in terms of cumulative completeness for the first 30, 60, 90, 120, and 150 confirmed cases there was also great variation between countries as they moved into crisis mode (Figures 3.A and 3.B).
* Singapore, Canada, Mexico, and Australia have very high completeness for both sex and age among the first 30 as well as throughout the first 150 cases.
* However, in most countries, the data completeness decreased as the number of cases increased. In South Korea, for example, the completeness initially was very high, exceeding 90%, but dropped rapidly throughout the first 150 cases, potentially because the number of cases increased exponentially over a very short time period (the total number of cases increased from 30 to 156 over 5 days) (KCDC 2000).
* The pattern in US and Germany is unique in that for the first 90 cases data completeness improved, though it declined later, as reflected in the completeness among the total cases in the database.

Even in Canada, which had high completeness throughout the first 150 cases, the overall completeness is now low for all cases currently included in the database (1044 as of March 31). In Singapore, Mexico, and Australia, 153, 839, and 271 cases are included in the database, respectively, and the overall completeness remains high for these countries.

Hover over the figures to see values and more options.

Figure 3.A. Completeness of age reporting for the first 30, 60, 90, 120 and 150 confirmed cases by country (among 29 countries with 150 or more confirmed cases in the database)

Figure 3.B. Completeness of sex reporting for the first 30, 60, 90, 120 and 150 confirmed cases by country (among 29 countries with 150 or more confirmed cases in the database)
Note: Countries are sorted in descending order of completeness for the first 30 cases and completeness for the first 150 cases. All cases refer to all cases currently included in the database. China is excluded in this figure, since it is the first country affected by the epidemic.

3. What can we do to improve the database?

Patients care and epidemic control is a top priority in the middle of a crisis. Countries understandably can be delayed in sharing case-level data with WHO and other countries, while more detailed and complete data are available and shared within each individual countries - see examples in China, South Korea, and US. However, in rapidly unfolding pandemic outbreaks, sharing data across countries is critical. Having open access, good quality case-level data will help us understand the global pandemic and develop best strategies to combat it.

A streamlined and coordinated epidemic surveillance data system should be developed and used by countries. The database system should include a limited number of essential variables that are critical to develop rapid and specific interventions. For data collection, data entry tools should be developed to minimize human errors, while the unit of data is useful for research and programmatic responses. Importantly, the system should improve processes for reporting and sharing case-level data with WHO as well as the public to optimize timeliness. Without high quality source data from individual countries, efforts to establish a global database have only limited value. After this immediate crisis is over, this should be a top priorities for WHO, member countries, and the global health data community.

  • Last updated on: 2020-03-31
  • For typos, errors, and questions, contact me at www.isquared.global

Making Data Delicious, One Byte at a Time, in good times and bad times.